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Abstract 

Cascades of Poisson processes are probabilistic models for spatio-temporal phe¬ 
nomena in which (i) previous events may trigger subsequent events, and (ii) both the 
background and triggering processes are conditionally Poisson. Such phenomena are 
typically “data rich but knowledge poor”, in the sense that large datasets are avail¬ 
able yet a mechanistic understanding of the background and triggering processes which 
generate the data are unavailable. In these settings nonparametric estimation plays 
a central role. However existing nonparametric estimators have computational and 
storage complexity 0(N 2 ), precluding their application on large datasets. Here, by 
assuming the triggering process acts only locally, we derive nonparametric estimators 
with computational complexity O (N log N) and storage complexity O(N). Our ap¬ 
proach automatically learns the domain of the triggering process from data and is 
essentially free from hyperparameters. The methodology is applied to a large seismic 
dataset where estimation under existing algorithms would be infeasible. 


1 Introduction 

Several important real-world processes can be conceptualised as a series of events occurring 
in time and space, possibly spontaneously, where each event has the capacity to trigger 
subsequent events. For example, events in epidemiology correspond to the infection of an 
individual by a transmissible disease; the infected individual may then go on to cause subse¬ 
quent infections. In many settings it is possible to obtain large, detailed data that catalogue 
the occurrence of events but do not specify individual cause-effect relationships. For exam¬ 
ple, in epidemiology it is generally not possible to identify the cause of a specific infection. 
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the ARC Centre of Excellence for Mathematical and Statistical Frontiers. 
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Cascades of Poisson processes (COPP) have been successfully applied to facilitate inference 
and prediction in this setting. COPP were originally developed by the geological community 
in the 1980s for statistical modelling of earthquake/aftershock data (Ogata, 1988). Recent 
and varied applications of COPP models have included modelling the dynamics of retaliatory 
gang violence (Mohler et al, 2011), terrorist activity (Porter and White, 2012; White et al, 
2014), success of commercial book sales (Deschatres and Sornette, 2005), military conflicts 
(Blundell et al, 2012), disease at the cellular level (Sornette et al., 2009), “retweet cascades” 
on the social network Twitter and editing patterns on Wikipedia (Simma and Jordan, 2010). 

In mathematical terms, COPP may be viewed as branching processes with immigration. 
Inference in general branching processes is challenging since complex, nonlinear, multidi¬ 
mensional models do not generally admit closed form solutions for estimators (Ogata, 1988). 
Moreover COPP models can produce multimodal or very flat log-likelihood functions, pre¬ 
cluding reliable numerical algorithms. To overcome these difficulties, Zhuang et al. (2002) 
viewed inference as an incomplete data problem and adapted the Expectation-Maximisation 
(EM) algorithm (Dempster et al, 1977) to this setting. In brief, the information about which 
events arose from the background, which events were triggered and what those triggers were 
(collectively the “branching structure”) is required to specify the likelihood, but is unob¬ 
servable. By describing the branching structure probabilistically, the EM algorithm seeks 
instead to maximise the expected likelihood, which can be achieved for certain parametric 
COPP models (see Veen and Schoenberg (2008) for examples). 

In the absence of a mechanistic understanding of either the background or triggering 
processes, nonparametric estimators play an important role (Zhuang et al, 2002; Marsan and 
Lenglin, 2008). However Sornette and Utkin (2009) showed that these estimators can exhibit 
large bias at small-to-moderate sample sizes (.N < 10 4 ); it is therefore important to integrate 
as many samples as possible into nonparametric inference. Unfortunately existing algorithms 
scale poorly, with computational and storage complexity of existing approaches being 0(N 2 ). 
For example, on earthquake data Marsan and Lenglin (2008) reported using a sample size of 
N = 6,109 due to computational limitations, whereas seismological datasets now regularly 
exceed N > 10 5 (Hutton et al, 2010). The “big data” now available in many application 
areas for COPP models delivers a pressing need to develop scalable nonparametric estimation 
procedures. 

In this paper we develop accelerated nonparametrics for COPP. Our approach is based on 
truncation of kernel density estimators (KDEs) to include only terms which are temporally 
and spatially local to an event of interest. The domains of the KDEs are chosen adaptively 
based on nearest neighbour distances, rendering the approach essentially free from hyperpa¬ 
rameters. The complete algorithm, described in Section 2, enjoys computational complexity 
of just 0(N log N) and requires 0(N ) storage. Using seismological data, we demonstrate in 
Section 3 that our estimators facilitate the analysis of larger datasets than was previously 
possible and provide an empirical assessment of performance in this setting. Section 4 closes 
with a discussion of challenges and extensions to our methodology. 
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Figure 1: An illustration of a cascade of Poisson processes (in lxl dimensions). Left: 
Background events (grey) arise according to a Poisson process with conditional intensity 
function y, whereas triggered events (white) are caused by a previous event (P, x l ), indicated 
by an edge in the figure, according to another Poisson process with conditional intensity 
function g(», (P, x 1 )). The set F of edges is known as the “branching structure” and is 
statistically equivalent to a random forest. Right: The “trigger function” g is strictly zero 
for t ^ ti, so that events observe a causal order in time. Here contours portray a local, 
isotropic trigger function, but isotropy is not required for the derivations in this paper. 

2 Methods 

We proceed as follows: Sections 2.1 and 2.2 set up notation and formally define the COPP 
model class. Section 2.3 surveys existing algorithms and nonparametric estimators for COPP. 
Section 2.4 contains the core of our methodology for scaling inference to large datasets. 


2.1 Setting and notation 

Consider a metric space (X,d) formed as the finite product Xf =1 (W, d t ) where each co¬ 
ordinate (A), di) is assumed to be a locally compact metric space. Here d is a product 
metric; in this paper all product metrics are induced by the Euclidean norm || • (| 2 , so that 
d(x,y) = \\(di(xi,yi),... ,d p (x p ,y p )) || 2 for x,y e X. Consider a simple point process X on 
[0, oo) x X adapted to a filtration F t . For clarity we equip the time domain [0, oo) with the 
Euclidean metric, though an arbitrary locally compact metric may be used. Write B$(x) cz X 
for the open ball of radius 5 > 0 centred on x e X. A measure A on [0, oo) x X, known as 
the conditional intensity of the point process X, is given by 


A (t, x) 


P {exactly one event occurs in [t, t + At) x B$(x) \ Ft) 
At,s[o At x 5 


( 1 ) 


In this paper Ft will be the natural filtration; the history up to time t, i.e. F t = {(P, x l ) : t l < 
t} where P is the time and x l are additional coordinates associated with event i (e.g. spatial 
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coordinates). It can be shown that the finite-dimensional distributions of X are characterised 
by the conditional intensity function A; in this paper inference and prediction for A" is carried 
out entirely through inference for A. (We assume X is such that the associated measure A 
is finite on every compact subset of [0, go) x X and contains no atoms, so that Eqn. 1 is 
well-defined.) 

2.2 Cascades of Poisson processes 

This paper concerns the class of point processes X known variously as branching processes 
with immigration, COPP models (Simma and Jordan, 2010), “self-exciting point processes” 
(Mohler et al, 2011), “epidemic type aftershock sequences” (Ogata, 1988) and “Hawkes 
processes” (Blundell et al, 2012). Such models are characterised by a conditional intensity 
A which is expressed as a superposition of point processes: 

A(t, x) = n(t, x) + Yj 9((t, *), (C **)) (2) 

Here /i denotes a (possibly time-varying) background intensity and each g(», (t\ x 1 )) rep¬ 
resents a “triggering” intensity, giving rise to offspring from the trigger event ( t l ,x l ). In 
this formulation, all events (t l ,x l ) are able, in principle, to trigger offspring at future times 
t > f] the generative model therefore can give rise to “cascades” of events, which explains the 
COPP nomenclature (Fig. 1). Estimation for COPP models is equivalent to estimation for 
the background (//) and triggering (g) intensities. In an application to earthquake data be¬ 
low, we model a time-independent background intensity /i{t, x) = fi(x), but for completeness 
we present the general case of time-varying background intensities. 

2.3 Stochastic declustering 

In applications of COPP models, data Ft obtained over a time interval [0, T] do not them¬ 
selves specify which events arose spontaneously, which events were triggered by previous 
events and what these triggers actually were (collectively the “branching structure”). Such 
information can be though of as a forest F on the events in Ft and is illustrated in Fig. 
1. Together T> = {Ft,F} can be thought of as complete data; inference for each of fi and 
g based on P is a straight forward density estimation problem. In applications where the 
branching structure F is unobservable, inference based only on Ft can proceed via the EM 
algorithm (Zhuang et al, 2002). In the case where fig, gg are parametrised by 6 (Veen and 
Schoenberg, 2008), the EM algorithm proceeds by alternating between taking an expectation 
over random forests F (“E-step”) and maximising the expected log-likelihood Ej? \og{jp(T>\6)) 
over parameters 6 (“M-step”). This procedure, which is sometimes referred to as “stochas¬ 
tic declustering”, also exists in nonparametric flavours (Zhuang et al, 2002; Marsan and 
Lenglin, 2008). 

Whilst theoretically elegant, stochastic declustering is highly computational; for each 
event i it is required to exhaustively enumerate all possible triggers of i, along with their 
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associated probabilities of being the actual trigger. Consequently the M-step must be per¬ 
formed with 0(N 2 ) weighted samples. To reduce this computational burden, Moliler et al. 
(2011) developed a Monte Carlo alternative which approximates the E-step by sampling a 
single forest from the conditional distribution F\Ft, h, g, then substituting this into the com¬ 
plete data likelihood to facilitate the M-step based on only O(N) unweighted samples. As 
a consequence the algorithm does not converge to a single pair of conditional intensities /i, 
g, but instead samples from a set of plausible intensities. This algorithm has demonstrated 
good performance in practice (Mohler et a/., 2011) but does not currently share the same 
theoretical guarantees as the EM approach. Note that the E-step of both existing approaches 
requires 0(N' 2 ) computational and storage complexities. In this paper we develop a princi¬ 
pled methodology, based on the EM algorithm, that requires only 0(N log N) computational 
complexity and O(N) storage complexity. 

2.4 Accelerated nonparametrics 

Below we outline the main component of our proposed methodology, that targets the compu¬ 
tational and storage complexity of both the E-step and the M-step in stochastic declustering. 
Background events that are distant (either spatially or temporally) to (t, x ) are likely to con¬ 
tribute little to the value of any estimate £i(t,x) of the background intensity at (t,x). Our 
proposal is therefore to estimate intensity functions using only local information. A naive 
approach would be to restrict attention to events within a ball Bn(t, x) of fixed radius R 
about the point (t, x). However, in areas where background events are sparse, this ball might 
not contain enough points to enable an accurate approximation of the background intensity 
at (t, x). Noting that the all-nearest-neighbours problem can be solved with time complexity 
C ) (A r log N) and storage complexity 0(N ) (Vaidya, 1989), it is computationally appealing to 
choose R adaptively such that the ball includes a constant number L of nearest neighbours. 
Here L should be taken sufficiently large that approximation error, at the level of derived 
quantities of interest, is negligible. Full details are provided below. 

It will be convenient to relabel the time variable t as the first coordinate, so that our data 
Ft can be represented by a N xp matrix X whose entries in the first column i represent the 
time of the ?'th event, and the remaining columns Xi^, ..., Xi iP represent additional covariates 
associated with event i (typically spatial coordinates). Write a(i,j) for the row index in X 
of the jitli closest event to event i as measured by the standardised Euclidean distance 


B uixi,* i x j i ) 



(3) 


where x^. = (x^i,... ,x itP ) and cr*, is a characteristic length scale for the covariates x.^ = 
(xi^k, ■ ■ ■, XN,k) T that must be specified. We adopt the convention that a(i, 1) = i and 
consider 1 ^ j ^ L. Our approach is nonparametric and will be presented here using 
Gaussian kernels. We will write 0(cc|ra,S) for the density of a multivariate Gaussian with 
mean m and covariance S. evaluated at the point x. Then a simple KDE, based on data 
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points x ijt e M p , is given by 


N 

\{x) « 2 D cr (x i) . 1 £C a(ij A-),.)diag(cr)). (4) 

i=l 

Here K is a hyperparameter that controls kernel bandwidth; an optimal choice, in terms of 
minimising mean square error (MSE), is given by K ~ 7V 4 /( 4+p ', with corresponding MSE 
~ jV _4 /( 4+p ) (see e.g. Li and Racine, 2007). 

Remark 1: The use of a vanishing-tailed kernel implies that influences are considered 
to act locally. We note that our methodology below also applies to non-Gaussian kernels, 
provided that they are local in this sense. 

Remark 2: The characteristic length scales cr that must be specified will not affect in¬ 
ference when L is sufficiently large and are therefore not hyperparameters per se. Indeed, 
when L is increased we allow more distant events to be considered as possible causes, but 
these will have negligible contribution to any sensible KDE. 

Remark 3: The simple, illustrative estimate in Eqn. 4 does not respect the COPP model 
structure in Eqn. 2. Below we accelerate computation in stochastic declustering, directly 
exploiting the COPP model structure. 


2.4.1 Accelerated E-step 

For accelerated computation the statistical is encoded by a N x L matrix P with entries 


Pj.i = P[event i arose from the background] 

P.j,i = P[event i was triggered by event a(i, /)]. 


In particular we assume that P is a stochastic matrix (unit row sums), i.e. the cause of an 
event always belongs to its L nearest neighbours. In practice it will be necessary to choose 
L sufficiently large that results are approximately invariant to further increase in L. The 
matrix P characterises the distribution F\Ft, /i, g over forests F given data Ft and intensity 
functions fi , g. Indeed, we have that 


Pi, i = 


/^(®i,«) 

A(*i,.)’ 


Pi,i = 


A(«Ci,«) 




(5) 


where 


L 

A(flJi ) «) = /^(^i,*) T ^ 1 ' > l}- 

1=1 

By restricting attention to the L nearest neighbours, this accelerated E-step requires 0(N ) 
complexity and storage, compared to the 0(N 2 ) of existing approaches. 
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2.4.2 Accelerated M-step 

Conditional on P, we define estimates 


Tl p i,i x i,k 

Th Pi, i 


m h = 


Tl p i.i(xi,k z m k) 

Th Pv 


X\2 


X \2 

Oi. = 


( 6 ) 


for the mean and standard deviation of the hth covariate corresponding to background events. 
Fix an integer K\ such that 2 ^ K\ ^ L and write /3(i) for the index in X of the A']th 
closest background event to event i. Ah will be used to adaptively select an appropriate 
kernel bandwidth and can be elicited following the optimal K\ ~ (2^1i Pyi) 4 ^ 4+P ^ rule. 
Now f3(i) is an unknown quantity with uncertainty encoded by the entries in P ; bearing this 
in mind we can define 


dj [D a x (xi ,, j# )] 


(7) 


where the expectation is taken over all possible assignments of background and trigger events, 
weighted according to P. Then df- is an estimate for the (standardised) distance from event 
i to the jth nearest background event. These values may be obtained exactly (Zhuang et 
al, 2002) or numerically using Monte Carlo estimation (Mohler et al., 2011), in the latter 
case sampling a branching structure F from F\P T , y, g and then computing D a x (x^,, */ 3 (i),.) 
directly based on F. In experiments below we took the latter approach. 

We estimate the background intensity at event Xi t , as 


N L 

% 1] P i^ X l X i*i d f' dia §( CrX )) « ( 8 ) 

0 =1 1 =1 


When L = N, the second approximation becomes exact, but for L < N, the latter expression 
provides a relaxation of the former, with favourable computational and storage complexity. 

The same principle of adaptive truncation based on nearest neighbour distances is used 
in the differential domain to approximate the triggering function g. We construct a NL x p 
matrix Y whose entry in the L(i — 1) + Zth row and fcth column is given by x,^ — x a (i n 
i.e. the p-dimensional vector that joins event a(i, l ) to event %. We refer to these vectors as 
“A-events”. Let Q be a NL x 1 vector with L(i — 1) + jth entry P t j for j # 1 and 0 for 
j = 1. Conditional on Q, we define 


£ji= 1 QiV 


m h = 


i.k 


Til Qi 


TllQi 


Y \2 _ 
Cf u ) — 


Y\2 


(9) 


to be the sample mean and standard deviation of A-events. 

Write y(z, l ) for the index in Y of the Ah closest A-event to A-event i, as measured 
by D a (y i t ,yj .) where is the standard deviation of y m k = (yi t k, ■ ■ ■, y P ,k) T , with the 
convention that y(i, 1) = i. Also write 5(i) for the index in Y of the A^th closest cause- 
effect event to the A-event i. As before, the optimal bandwidth can be elicited following the 
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Algorithm 1 Accelerated nonparametrics for cascades of Poisson processes. 


Initialise: 

7: 

Estimate using Eqn. 8 

1: Compute and cache Y, a(«, •), 7 («, •) 

8: 

Compute m Y , cr Y using Eqn. 9 

2: Initialise P^ and Q ^ 

9: 

Compute d Y using Eqn. 10 

3: n <— 1, e <— oo 

10: 

Estimate g(x i: ,,Xj t ,) using Eqn. 11 

Stochastic declustering: 

E-step: 

4: while e > 10~ 2 do 

11: 

Re-estimate P ^ and Q ^ using Eqn. 

M-step: 


5 

5: Compute m x , cr x using Eqn. 6 

12: 

n <— n + 1, e <— \\P ^ — P^ -1 / i/iV 

6: Compute d x using Eqn. 7 

13: 

end while 


K 2 ~ <3i) 4/(4+p) rule. Now <5(z) is an unknown quantity with uncertainty encoded by 

the entries in P] bearing this in mind we can define 

dj = E F [D a Y(yi^,y s ^)\ (10) 

where the expectation is taken over all possible assignments of background and trigger events, 
weighted according to P. These values may again be obtained exactly or numerically using 
Monte Carlo estimation, and we did the latter. 

We estimate the trigger intensity at x it ,, as contributed by event a(i,j), as 

L 

% ^ Q'y(L(i—l)+j,l) ( t > {yL(i—l)+j,»\y'Y(L(i—l)+j,l),»i ))• (H) 

1=1 

Again, when L = N, Eqn. 11 is exactly the standard KDE based on all data points y,,, but 
for L < N, Eqn. 11 provides a relaxation of this estimator with favourable computational 
and storage complexity. 

This accelerated M-step has computational complexity 0(N log N), compared to the 
0(N 2 ) complexity of existing approaches. Moreover since the L-nearest neighbour problem 
can be solved using 0(N ) storage, we also achieve 0(N ) storage requirements, compared to 
the 0(N 2 ) of existing approaches. 

The accelerated algorithm presented above, which is somewhat tricky to derive, is the 
main contribution of this paper. Complete pseudocode is provided in Alg. 1, which proceeds 
by initialising the distribution over forests, as encoded by P = P (0) . For all experiments in 
this paper we used a uniform distribution such that = 1/2 and = 1/(2 L) otherwise. 
Then entries of P were set to zero according to whether Xjj > Xj.\ and P was row-normalised 
to ensure each row defines a probability distribution. The algorithm is terminated when 
consecutive iterations change the distribution over branching structures F by less than ICC 2 
in total variation distance. 

Remark Jp. As is common for EM-type algorithms, formal convergence analysis is math¬ 
ematically intractable; the estimator need not converge to a global optimum and estimator 
performance must be assessed empirically. 







3 Results 


We proceed as follows: Section 3.1 describes a typical application of COPP models arising in 
seismology. Section 3.2 investigates empirically the computational advantages of the accel¬ 
erated methodology in this setting. Finally Section 3.3 compares data-driven nonparametric 
estimation with model-based estimation via predictive likelihood scores. 


3.1 Earthquake data 

We obtained data on N = 564, 750 earthquakes occurring in a rectangular area around Los 
Angeles between longitudes 122°W and 144°W and latitudes 32°N and 37°N (733 km x 
556 km) between January 1st, 1932 and December 31th, 2012 (Hutton et al, 2010). Data, 
which are available for download from http://www.data.scec.org/, include occurrence times 
and locations based on measurements from % 400 sensors positioned throughout Southern 
California. 

Decades of geophysical research have led to a deep understanding of the statistical prop¬ 
erties of earthquake aftershocks (Ogata, 1988). A widely used parametrisation for the trigger 
function is the “epidemic-type aftershock model” 


g e (At 1 Ax,m i ) 


K 0 e a ( mi ~ M °) 

(At + c) 1+aj (||Aa;||| + d) 1+ p 


( 12 ) 


Here the A prefix denotes coordinates relative to the z'tli event (t l . x l ) and m, is its associated 
magnitude. Based on Eqn. 12, Veen and Schoenberg (2008) performed inference for 6 = 
{K 0 ,a,c,uj,d, p) using a subset of N = 6,796 events from the Southern California dataset 
post-1984 (where data are considered complete above magnitude M 0 = 3), based on a 
piecewise constant partition of X = M 2 according to 8 regions of geological fault activity. 
Given that the approach of Veen and Schoenberg (2008) is heavily constrained by Eqn. 
12, it is compelling to see whether nonparametric, data-driven models can compete with 
this parametric benchmark. Our nonparametric estimators are constructed as mixtures of 
Gaussians and are able, in principle, to approximate non-radial intensities such as Eqn. 12 
with arbitrary precision. Previously it had not been possible to perform this comparison, 
since nonparametric estimators were computationally restricted to N % 10 3 samples, which 
Sornette and Utkin (2009) argued was insufficient for robust estimation. 

In experiments below we follow the seismology literature by assuming a time-independent 
background intensity p(t,x) = p(x). We did not include magnitude as a coordinate of A, 
since aftershock magnitude need not be similar to mainshock magnitude. Characteristic 
length scales, required for defining the L nearest neighbours in Eqn. 3 but not hyperparam¬ 
eters per se, were taken to be 1 day, 0.1° latitude and 0.1° longitude. Boundary effects were 
not modelled. 
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Figure 2: Selecting the adaptive truncation parameter L. Top: Earthquake data available for 
the years 1960-1965, with the 34th parallel indicated as a horizontal line. Bottom: Estimated 
background intensity // of earthquakes along the 34th parallel. 

3.2 Accelerated nonparametrics 

The proposed approach has computational and storage requirements which are linear in 
the adaptive truncation parameter L , with larger L leading to smaller approximation error. 
In order to inform our choice of L we plotted the estimated background intensity fx for a 
hypothetical earthquake along the 34th parallel, varying L. Fig. 2 suggests that results, 
based here on the N = 1509 events recorded between 1960 and 1965, are approximately 
independent of L when L ^ 10; we therefore took L = 10 for all subsequent experiments. 

Our methodology aims to relax current limitations on the size and scope of nonpara- 
metric analyses; to test this we implemented both the proposed and existing algorithms on 
the same platform (MATLAB R2015a) and performed calculations using a single 2.53GHz 
processor and 3GB of RAM. Specifically, we compared against exact (Zhuang et al., 2002) 
and approximate (Mohler et al . , 2011) stochastic declustering. To ensure fair comparison, 
all algorithms were based on the same Gaussian kernel with fixed (non-adaptive) bandwidth 
parameters K\ = K 2 = 10 and identical total variation stopping rules were used. The 
sampling-based algorithm of Mohler et al. (2011) does not converge to a unique estimate; we 
therefore terminated this algorithm after 10 iterations in situations where convergence in to¬ 
tal variation was not achieved. The threshold 10 was chosen since our accelerated estimator 
typically required fewer than 10 iterations to converge in the above sense. 

A dataset was constructed based on the first N events from the starting point of January 
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Figure 3: Computational times. [Here asterisks are used to denote the final point at which 
execution of the algorithm was permissible under storage limitations. The proposed “Accel¬ 
erated” approach returned an “Out of Memory” error at IV = 10 5 .] 


1st, 1932. We examined the computational time required for termination of each of the 
three algorithms, while increasing N. Fig. 3 demonstrates that our accelerated approach is 
significantly quicker than both exact (Zhuang et al, 2002) and approximate (Mohler et al, 
2011) stochastic declustering. The method of Zhuang et al. (2002) was heavily constrained 
by 0(N 2 ) storage and quickly ran out of memory, whereas Mohler et al. (2011) was able 
to go further, with storage O(N), but was limited by CPU time. In contrast, the proposed 
methodology is able to quickly scale to much larger sample sizes. We emphasise that, whilst 
it is surely possible to improve each implementation for a given N, it remains true that 
the proposed methodology enjoys favourable computational 0(N log N) and storage 0(N ) 
complexities, with negligible loss of accuracy compared to the 0(N 2 ) competing approaches. 

3.3 Parametric versus nonparametric 

Our accelerated methodology allows us to investigate, for the first time, whether nonparamet¬ 
ric estimators based on large datasets can be competitive with domain-specific parametric 
models. We initially took the parametric model of Veen and Schoenberg (2008), based on 
decades of geological research, as a proxy for the true data-generating intensities in the 
Southern California dataset. In contrast no geological knowledge entered into the nonpara¬ 
metric estimators. We then compared our nonparametric estimator g against the parametric 
ge of Eqn. 12, where the former was based on the N = 17, 891 events occurring in 1984 and 
the latter based on parameters 6 reported in Veen and Schoenberg (2008). Fig. 4 shows that 
g approximately recovered the correct support of gg, with the spatial marginal being more 
accurate than the time marginal. This rough agreement give confidence that the proposed 
nonparametric estimators are indeed targeting the correct data-generating intensities. In 
order to probe robustness of these conclusions, we repeated the procedure using data on 
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Figure 4: Comparison of triggering frequencies, proportional to g , inferred using the 1984 
data and (a) the parametric method of Veen and Schoenberg (2008), which uses geological 
knowledge, and (b) the proposed nonparametric methodology. [The parametric estimator 
depends on the magnitude m t of the mainshock; for visualisation we chose a magnitude 
which equates total area under both curves.] 


years 1985-1990; in each case a similar level of approximation was observed between g and 
g e (see Fig. SI). 

Encouraged by accurate recovery of the trigger function, we then assessed the predic¬ 
tive performance of nonparametric methods. The standard approach to testing earthquake 
models was established by the working group for the development of Regional Earthquake 
Likelihood Models (RELM) in 2001 and is reviewed in Bray and Schoenberg (2013). In 
brief, each competing model is required to estimate the number of earthquakes in each of a 
number of spatio-temporal bins, where the number of events in each bin is assumed to follow 
a Poisson distribution with intensity parameter equivalent to the forecasted rate. The sim¬ 
plest performance measure in this setting is known as the L-score, that evaluates the joint 
probability of held-out data according to the proposed model, computed as a product of 
independent Poisson probabilities. Using both our accelerated nonparametric estimator and 
the parametric model of Veen and Schoenberg (2008), we attempted to predict earthquakes 
for each of the 31 days of December in each of 2010, 2011 and 2012, given the previous 
7 days’ events. Our nonparametric approach was based on a large dataset containing the 
N = 90, 601 events recorded from 2003 to 2009. Given estimated intensity functions, the 
7 days prior to each day in December were used to construct a predictive intensity over 
the domain of the held-out data. By computing the predicted number of events occuring 
in each of the 30 regions whose boundaries are defined by integer values of latitude and 
longitude, we are able to quantify predictive performance under the L-score, such that larger 
values represent better performance. Results showed that accelerated nonparametric meth¬ 
ods were competitive with, but not superior to, model-based prediction (log-L = 3.34 versus 
log-L = 3.92 respectively). Deconstructing this result, we found that nonparametric meth¬ 
ods tended to systematically under-estimate the reproductive ratio R (the expected number 
of offspring from any given event) relative to the parametric estimator. Fig. S2 compares 
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Figure 5: A typical sample from an inferred distribution over branching structure, or forest F, 
for earthquake data from 1984 with latitude between 32°N and 37°N. [Each point corresponds 
to one recorded seismic event. Edges are used to join mainshocks (red) to their aftershocks 
(blue). Size corresponds to earthquake magnitude, which was not used here to estimate the 
branching structure. To improve visualisation, we do not display latitude information in this 
figure.] 


estimates for R based on data from each of the years 1984-1990. Due to under-estimation of 
R , more events were deemed to be background and were not foreseen, explaining the lower 
L-scores. 

In addition to estimation of intensity functions, in principle one can also estimate the 
branching structure F. Fig. 5 displays a typical point estimate for branching structure. 
However, identification of F from occurence data is fundamentally extremely challenging, as 
pointed out by Sornette and Utkin (2009). Indeed, we observe in Fig. 5 that F contains 
several mainshock-aftershock links that correspond to a time delay of several weeks; this 
would typically be considered unrealistic on geological grounds and supports the intuition 
that it is extremely challenging to achieve accurate estimation of branching structure. 
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4 Discussion 


Point processes that admit COPP structure arise in many topical scientific analyses, where 
typically a precise understanding of the background (//) or triggering (g) processes is cur¬ 
rently unavailable. For example it is unclear how to formulate a parametric model for the 
triggering process underlying crime waves, or for the spread of infection through a human 
population. The nonparametric methods described here have potential to provide new in¬ 
sights in such systems. In this contribution we accelerated computation for these estimators: 
Using adaptive truncation based on nearest neighbour distances, we were able to attain 
computational complexity 0(N log N) and storage complexity O(N), with negligible loss of 
accuracy compared to existing 0(N 2 ) procedures. Using seismology data as a test-bed we 
demonstrated a practical increase in algorithmic efficiency that allowed the integration of 
more data for fixed computational cost. 

The efficiency of our approach resulted from adaptive truncation in domains of both 
the background and trigger intensity functions. A non-adaptive truncation was previously 
proposed in Simma (2010). There an absolute threshold r in time was applied, beyond 
which the triggering function was not evaluated, assumed to be zero. In that approach, r 
must be chosen by hand, which could be difficult in settings where little is known about the 
triggering process. Our proposal, in comparison, thresholds not only in the time domain but 
also in the space domain and the domain of the background intensity function. The resulting 
computational complexity is 0(N log N) with constant C proportional to the average number 
of events occurring in the volume [t—r, t] x B$(x) for thresholds r, S. Moreover, unlike Simma 
(2010), our methodology provides a mechanism to select r, S adaptively, by implicitly solving 
for C = L, thereby mitigating an important practical issue. The truncation parameter L is 
not a hyperparameter per se and should be chosen sufficiently large that any further increase 
in L leads to negligible variation in the estimated intensity functions, or indeed any derived 
quantities of interest. 

Our preliminary empirical investigation demonstrated inferential and predictive perfor¬ 
mance that was competitive with parametric estimators, but also revealed systematic down¬ 
ward bias in estimation of reproductive ratios R. This may be because the introduction of 
nonparametric uncertainty into the trigger function raises the evidence threshold to conclude 
that an event was triggered. Further research will be required to address this methodological 
issue; indeed, this contribution suggests a number of interesting extensions that are made 
possible by accelerated computation; (i) reformulating the proposed estimators within the 
Bayesian framework, allowing for (a) sequential updating of estimators /t, g as new data 
arrive, and (b) regularising the reproductive ratio R via a prior distribution p(R ), (ii) incor¬ 
porating observation noise into KDE, (iii) introducing latent variables to account for missing 
occurrence data, and (iv) migrating nearest neighbour computation to GPUs (Pan et al, 
2010 ). 

Whilst we focused on the popular class of processes with continuous state space A, there 
exist a number of additional techniques to reduce computational complexity in discrete state 
spaces (e.g. defined by networks); see Simma and Jordan (2010); Zhou et al. (2013) for 
details. 
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